Pretty obvious answer of K=2 there.
For K=2, I plotted the proportion of each sample’s genome that was assigned to eith er K1 or K2:
Given the high amounts of missingess from some of the regions, I wanted to make sure that missingness doesn’t predict a sample’s assignment to one of the two populations. The below plot shows frequency of missing genotypes (y-axis) across samples (x-axis); each sample is colored by the proportion of the genome assigned to K2:
## [1] 0.01368808
## [1] 0.009160282
## [1] 0.007722963
-Regions of interest on: - PC1: Two hits on chrm2 () and one on chrm11 - PC2: Region upstream of dhps on chrm 8 and aat1 on chrm 6 - PC3: dhps, crt
randomForest package to conduct supervised machine learning
##
## Call:
## randomForest(formula = loc2 ~ ., data = train, importance = TRUE, ntree = 10000, mtry = 200, proximity = TRUE, type = classification)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 200
##
## OOB estimate of error rate: 20.45%
## Confusion matrix:
## North South class.error
## North 113 54 0.3233533
## South 18 167 0.0972973
## predictions.test.data
## North South
## North 42 14
## South 5 57
## [[1]]
## [1] 0.921947
We can slightly improve this by adjusting the rf paramaters, but it’s nothing to write home about
Here are the SNPs associated with the north/south divide, as well as some QC:
##
## Call:
## randomForest(formula = pop ~ ., data = train, importance = TRUE, ntree = 10000, mtry = 200, type = classification)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 200
##
## OOB estimate of error rate: 8.5%
## Confusion matrix:
## 0 1 class.error
## 0 143 29 0.16860465
## 1 18 363 0.04724409
## prediction.test.data
## 0 1
## 0 44 13
## 1 5 122
## [[1]]
## [1] 0.8976378
TO DO: fix seed issue ## Some additional checks
How about treating K as a continious variable (so using regression instead of classification):
##
## Call:
## randomForest(formula = K1 ~ ., data = train, importance = TRUE, ntree = 1000, mtry = 200)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 200
##
## Mean of squared residuals: 0.02553361
## % Var explained: 79.35
## tree variable minimal_depth
## 1 1 chr1_138823 6
## 2 1 chr1_155939 8
## 3 1 chr1_349253 28
## 4 1 chr1_480434 15
## 5 1 chr1_536003 5
## 6 1 chr1_536315 13
## 7 1 chr1_537427 14
## 8 1 chr10_1063886 11
## 9 1 chr10_108733 3
## 10 1 chr10_1128179 14
##
## Call:
## randomForest(formula = loc2 ~ ., data = train, importance = TRUE, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 40
##
## OOB estimate of error rate: 15.11%
## Confusion matrix:
## North South class.error
## North 76 27 0.26213592
## South 7 115 0.05737705
## prediction.test.data
## North South
## North 20 14
## South 4 37